Algorithms for Molecular Biology — Latest Matching Preprints

1

Towards a Unified Exact Solution of Rearrangement Small Parsimony for Natural Genomes

Bohnenkaemper, L.; Frolova, D.

2026-06-28 bioinformatics 10.64898/2026.06.23.733974 medRxiv

Top 0.1%

3.3%

Show abstract

Phylogenetic reconstruction is a fundamental problem in comparative genomics. As a theoretical problem in rearrangement studies, this has been modelled as the Small Parsimony Problem (SPP), in which ancestral genome structures have to be determined minimizing the number of rearrangement events occurring throughout the phylogeny. This problem is of significant interest in microbial and cancer genomics, due to the prevalence and clinical importance of rearrangement events. Genome structures in this problem are expressed as sequences of markers, which are themselves oriented sequence features (such as genes) that abstract from non-structural variations. Recent research has focused on the problem under the natural genomes model, in which arbitrary variations in copy number of markers are allowed. Natural genomes are often studied under the DCJ-indel model, a model which has already been successfully applied to plasmid data. There also exist ILP solutions to a variant of the Small Parsimony Problem under the DCJ-indel model. However, these solutions are limited in their applicability, as they make some critical simplifications for tractability purposes: ancestral marker frequencies and precomputed putative ancestral adjancencies, with their predicted likelihoods, are assumed as input. This creates multiple problems from both a theoretical and practical perspective. Firstly, this simplification means that not the full state space is searched for a solution, but rather only the subset of genomes with the precomputed putative adjacencies, meaning an optimal solution to the exact SPP is not guaranteed. Secondly, marker frequencies are given externally, without any theoretical guarantees. Thirdly, the method used to precompute adjacencies relies on gene trees, which requires the use of genes as markers, when gene annotation is often unreliable, especially in regions with a lot of rearrangement. Additionally, this restricts the applicability of the approach to sets of genomes that are both divergent and large enough to be able to produce informative gene trees. This is, for example, rarely the case for plasmids, where nucleotide mutations are rarer than rearrangements and genomes are small. Hence, we revisit the problem to solve the exact SPP by introducing a cost to indel operations, which allows us to compute ranges of marker frequencies and derive theoretical results, that allow us to reduce the solution space that the ILP searches without sacrificing optimality. We show that this makes the problem tractable for the case of small and recently related genomes, first on simulated genomes, and then on a set of pathogenic plasmids which represent a realistic use case for the method.

2

Accelerating String Comparison in RLZ Compressed Sequences via LCE Jumps

Varki, R.; Boucher, C.

2026-06-16 bioinformatics 10.64898/2026.06.11.731742 medRxiv

Top 0.1%

3.2%

Show abstract

Relative Lempel-Ziv (RLZ) is an effective compression method for large, repetitive collections; however, the fundamental primitives required to elevate it from a passive archival format to a tractable representation for compressed construction have yet to be fully established. In this paper, we introduce an algorithmic framework for structurally comparing and lexicographically sorting sequences of RLZ factors. We characterize when direct factor comparisons are necessary and when they can be bypassed using RLZ specific shortcuts. We further introduce a method for extending truncated factors into right-maximal matches, enabling the recovery of matching statistics from the RLZ parse. Experimentally, RLZ sorting achieved speedups of up to 3.93x over character-based sorting. Together, these results advance the use of the RLZ format as a foundation for compressed construction.

3

Fast Set Operations for Compact k-mer Sets

Alanko, J.; Depuydt, L.; MARCHET, C.; Puglisi, S. J.

2026-05-27 bioinformatics 10.64898/2026.05.24.727514 medRxiv

Top 0.1%

3.2%

Show abstract

The k-mer spectrum of a set of sequences is the set of k-length substrings the sequences contain. This lossy representation of sequence content pervades modern genomics. Recently, the spectral Burrows-Wheeler transform (SBWT) has emerged as a space-efficient representation of k-spectra that also supports efficient k-mer lookup queries and, more generally, easy navigation of the de Bruijn graph of the k-spectrum. In this paper, we examine primitive set operations, such as intersection, union, and set difference, on SBWT-encoded k-spectra and show that these operations can be supported efficiently. Moreover, efficient merging leads directly to a new memory-efficient algorithm for SBWT construction, which was able to build the SBWT for the 661K bacterial dataset containing 88 billion distinct k-mers in 50 hours using 186 GiB of RAM and 112 GiB of disk space. Given the pervasiveness of k-mer sets in genomics and the continued rapid growth of genomic databases, our work opens the door to a wide array of future applications that manipulate and reason about genomic data by dealing directly with simultaneously compact and searchable k-mer set representations offered by the SBWT. 2012 ACM Subject ClassificationTheory of computation [->] Design and analysis of algorithms Digital Object Identifier10.4230/LIPIcs.WABI.2026. Supplementary MaterialSoftware (Source Code): https://github.com/LoreDepuydt/sbwt-set-operations FundingThis work has benefited from funding from the French State under the France 2030 program, reference ANR-21-IDES-0006. The European Metropolis of Lille and the University of Lille are also acknowledged for their funding and support of the project WILL-CHAIRES-25-001-BOSSA.

4

DynamicDemiLog: A Single Sketch for Ultrafast Similarity, Frequency, and Cardinality Estimation

Bushnell, B. J.

2026-06-16 bioinformatics 10.64898/2026.06.12.731986 medRxiv

Top 0.1%

2.8%

Show abstract

Probabilistic cardinality estimators (HyperLogLog), similarity sketches (MinHash), and frequency estimators (Count-Min Sketch) are fundamental approximate data structures that each target one primary problem. We present DynamicDemiLog (DDL), a sketch that unifies cardinality estimation, set similarity, containment, element frequency and composition in one tiny data structure built from a single pass over the input stream. Using an inverted index over 200,687 RefSeq sketches (159,567 organisms), DDL performs all-to-all sketch similarity comparison of the full database in 30 seconds (128 threads, indexed) -- over 375x faster per query than Mashs brute-force all-to-all comparison of 91,282 sketches, or 31x faster without the index, at double the sketch resolution. DDL extends the LogLog register with a mantissa: each register stores a floating-point-encoded hash value consisting of an integer exponent (the leading-zero count) and a fractional mantissa (the sub-leading-zero bits), rather than the integer leading-zero count alone. This preserves enough hash information for meaningful register-by-register comparison -- a property that standard 6-bit registers lack -- while improving on LogLogs cardinality estimation machinery, including DynamicLogLogs early exit mask for high-throughput streaming. With a default 10 mantissa bits (16-bit registers, 2,048 buckets, 4 KB), DDL achieves a per-register false-match rate of 0.018% on unrelated random same-size sets (compared to 17.0% for LL6, a basic HyperLogLog implementation), enabling Weighted Kmer Identity (WKID), Average Nucleotide Identity (ANI), containment, and completeness estimation from register comparison alone. A 16-bit per-register observation counter provides element frequency information at trivial additional computation cost, and an additional byte tracks element composition (GC content, for biological data). Furthermore, DDLs high-specificity registers enable an inverted index structure (DDLIndex) that answers similarity queries against a database of N sketches in O(B + M) time, where M is the number of matching index entries, compared to O(N xB) for pairwise comparison. DDL achieves a 930x reduction in false register matches compared to LL6 (Section 11.1), accurately estimates ANI between full and partial genomes down to approximately 79% identity (at k=25, B=2,048), and maintains near-zero spurious similarity on unrelated inputs -- all at similar construction speed to LL6, and 3.5x faster than SetSketch.

5

CoSTAR: Coarse Stem-Topology Alignment of Pseudoknotted RNA Structures by Relation-Constrained Search

Archinuk, F.; Jabbari, H.

2026-05-26 bioinformatics 10.64898/2026.05.22.727263 medRxiv

Top 0.1%

2.6%

Show abstract

RNA structural alignment is a central task in comparative RNA analysis, but many efficient methods achieve tractability by restricting the class of admissible structures, often excluding pseudoknots. This exclusion is limiting for viral and regulatory RNAs, where conserved structure can remain informative even when sequence conservation is weak. We introduce a coarse RNA structural alignment algorithm that aligns secondary structures by searching over partial maps between stems rather than nucleotides. Each input structure is decomposed into stems, annotated with nucleotide-level features, and encoded by pairwise topological relations among stems. Alignment is formulated as a cost-minimizing partial stem map with skip operations, and the search tree is pruned by RNA-specific directionality and topological constraints derived from already aligned stems. For the stated cost function and over the class of injective, direction-preserving, topologically consistent stem maps, the search is exact. This shifts the dominant computational dependence from sequence length to the number and arrangement of stems. We evaluated the method on 2100 pairwise alignments sampled from seven Rfam families spanning 40-224 nucleotides and 2-15 stems. Across these benchmarks, the algorithm returned terminal coarse alignments in which every stem was either matched or skipped. We measured running time and search-tree width to characterize performance on diverse family-to-family comparisons. The experiments also show that ordering the input structures affects efficiency: using the structure with more stems as the search-driving structure reduces tree width. The resulting partial stem map is directly interpretable for RNA annotation and can be projected to nucleotide resolution for downstream sequence-structure analysis. The source code for CoSTAR is available at: https://github.com/TheCOBRALab/CoSTAR

6

DNA Compression with Genomic Language Models: Tokenization, Benchmarking, and an Information-Content Map

Macala, V.; Simecek, P.

2026-06-12 bioinformatics 10.64898/2026.06.10.731316 medRxiv

Top 0.1%

2.5%

Show abstract

Lossless compression and probabilistic sequence modeling are two faces of the same coin: a model that assigns high probability to a sequence can encode it in few bits via arithmetic coding. We exploit this duality to evaluate genomic language models as compressors of DNA, using compression primarily as an objective probe of generative sequence modeling rather than as a deployable storage system. We release DNAGPT2, a family of ten GPT-2-small models pretrained for one epoch on a single A40 using the DNABERT2 multi-species corpus that differ only in byte-pair encoding vocabulary size. Coupled with arithmetic coding, the best model reaches 1.47 bits per base (bpb) on the T2T human genome, fourth in the Cobilab compression benchmark and ahead of every general-purpose compressor. Our results suggest that NLP-style tokenization choices may be suboptimal for DNA: a 32-token BPE vocabulary compresses better than larger vocabularies. We also find that, in this benchmark, published long-context genomic LMs underperform a much shorter-context BPE GPT-2; we discuss in Section 5 that this is not a controlled context-length ablation, since the compared models also differ in architecture, training data, parameter count, and tokenization. Finally, we compute a per-nucleotide information-content map of the human genome and show that exons, introns, intergenic regions, and Alu repeats have statistically distinct information profiles.

7

Reachability-Preserving Minimum Edge Cut Problem and Applications in Biology

Xie, J.; Duan, Q.

2026-06-03 bioinformatics 10.64898/2026.06.01.729192 medRxiv

Top 0.1%

2.1%

Show abstract

Biological pathway analysis often requires identifying interventions that block reachability to an undesirable state, such as a disease-associated module, toxic byproduct, or adverse phenotype, while preserving reachability among essential biological functions. Motivated by this setting, we study the Reachability Preserving Minimum Edge Cut (RPMEC) problem: given protected terminals s1 and s2 and a target terminal t, the goal is to remove a minimum-cost set of edges that separates s1 and s2 from t while keeping s1 and s2 connected. This formulation naturally models pathway-level intervention design, where one seeks to disrupt harmful signaling, metabolic, or interaction routes without breaking required functional connectivity. We revisit the three-terminal undirected edge-cut case and analyze a Dijkstra-style dynamic programming algorithm that is exact on planar graphs but fails on general graphs. We characterize the structural condition required for exactness, namely frontier-realizability of optimal source-side regions, and identify biological graph representations where this condition is likely to hold after appropriate preprocessing, including curated planar pathway maps, Reactome-style hierarchy trees, SCC-contracted feedback modules, metabolic building-block DAGs with dominator structure, and functional-module quotients of protein interaction networks. We further discuss directed variants, approximation strategies, and exact alternatives based on ASP, MILP, bounded-treewidth dynamic programming, and important separators. The results provide a graph-theoretic foundation for deciding when fast greedy computation is reliable for biological pathway intervention problems and when more expressive exact optimization methods are needed. Author SummaryMany real-world networks require interventions that separate harmful or undesirable states while preserving essential connectivity. This situation appears in biological pathway analysis, where one may want to block reachability to a disease-related module, toxic byproduct, or adverse phenotype without disrupting communication among essential genes, proteins, reactions, or metabolites. We study this problem through the Reachability Preserving Minimum Edge Cut formulation. Unlike ordinary minimum cut, the solution must satisfy both a separation requirement and a preservation requirement. We show why a natural Dijkstra-style algorithm works only under specific structural conditions, such as planar, laminar, or module-like pathway graphs, and why it may fail on general graphs. The results help identify when fast graph algorithms are reliable for biological intervention problems and when exact optimization tools such as Answer Set Programming or integer programming are more appropriate.

8

Min-frame transformation enables more sensitive viral genome alignment

Doughty, R. D.; Banerjee, A.; Kille, B.; Warnow, T.; Treangen, T. J.

2026-05-22 bioinformatics 10.64898/2026.05.20.726535 medRxiv

Top 0.1%

2.1%

Show abstract

MotivationMaximal unique matches (MUMs) are a fundamental primitive in genome comparison, where they serve as high-confidence anchors for downstream multiple genome alignment. However, because MUMs rely on exact string matching, their effectiveness degrades with increased genome divergence and larger sets of genomes, inhibiting their ability to recover long homologous regions and reducing the number of base pairs covered by the multiple genome alignment. Additionally, existing approaches that improve robustness to mutation, such as spaced seeds or translated alignment methods, introduce trade-offs in specificity, scalability, or computational complexity. MethodsTo address this gap, we introduce the Min-Frame Transformation (MFT), a deterministic encoding of nucleotide sequences to sequences over a transformed alphabet that preserves the coordinate structure of the original sequence. At each position, the MFT selects a k-mer from a local window according to a fixed global ordering and assigns it a character in the transformed alphabet via a predefined mapping. This process captures local sequence context and can mask the impact of mutations, increasing the likelihood that homologous regions remain detectable as exact matches. The resulting transformed sequences can be indexed using standard string data structures, such as suffix arrays and suffix trees, enabling efficient extraction of MUMs without modifying existing algorithms. ImpactThe MFT is a novel computational approach for improving the robustness of MUM-based seeding for genome alignment by producing longer and more contiguous matches that span a greater fraction of the genome, leading to improved alignment coverage and SNP recall. Altogether, these improvements have the potential to result in improvements for downstream viral genome analysis applications such as phylogenetic inference and transmission analysis. FundingTandy Warnow: NSF grant 2316233 Todd J. Treangen: NSF grants 2126387, 2239114, NIH grants U19-AI144297, P01-AI152999

9

Insertions, deletions, and exchangeable couplings: a Dirichlet process over TKF92 domains and sites

Large, A. L.; Holmes, I. H.

2026-05-19 bioinformatics 10.64898/2026.05.16.725674 medRxiv

Top 0.1%

2.1%

Show abstract

The TKF92 model of molecular evolution--a linear birth-death process for indels, with finite-state continuous-time Markov chain substitutions--is exchangeable in residue identity at every site: the generative process treats amino acids symmetrically, conditional on a single substitution rate matrix. To introduce local heterogeneity, evolutionary models are often equipped with site-class mixtures, preserving this symmetry in the sense of de Finetti: conditional on the latent class, residues are still exchangeable. In a four-step theoretical ladder, we show how long-range structure such as couplings between distant sites can also be introduced exchangeably by using a Dirichlet process to partition sites into co-evolving classes. Our first step is a thorough analysis of TKF92 to establish sufficient statistics, limiting behavior, and inferential tools. We then lift the pairwise TKF92 hidden Markov model, in the limit of small time, to a time-indexed gravestone-augmented pair stochastic context-free grammar, and from there to its phylogenetic generalisation. This framing allows trajectories to be sampled exactly by Inside-Outside recursion. The third step places a Dirichlet process over the alive sites and asks co-keyed sites to evolve under a sparse Potts interaction -- an exchangeably-partitioned hidden direct-coupling model whose marginal alignment likelihood is unchanged from plain TKF92. The fourth rung of the ladder develops inference machinery: a Gibbs-Metropolis sampler that alternates alignment resamples, key-partition resamples, and stochastic parameter updates. We close several gaps along the way -- exact closed-form sufficient statistics for the linear birth-death-immigration component, the resolvable LHopital limit at{lambda} =, and a closed-form M-step for a recursive generalisation of TKF92 -- and we report a 1,000-family Pfam fit with K=4 site classes whose Potts atoms carry [~]0.54 nats of covariation per class-pair on top of a substantial single-site substitution model. Supplementary material, including full source code for inference, may be found at https://tkfdp.net/.

10

Quantifying and Predicting the Difficulty of Multiple Sequence Alignment with AlDiScore

Bodynek, M.; Martin-Fernandez, L.; Bettisworth, B.; Haag, J.; Stamatakis, A.

2026-06-02 bioinformatics 10.64898/2026.05.29.727837 medRxiv

Top 0.1%

1.9%

Show abstract

Multiple Sequence Alignment (MSA) constitutes an important and frequent operation in molecular sequence data analysis. There exist numerous tools, algorithms, and criteria to infer an MSA. This plethora of available approaches to MSA may induced an ensemble of divergent MSAs for the same underlying unaligned sequence set. Even a single MSA tool may infer distinct MSAs when varying the input parameters. Hence, when using a diversified set of MSA algorithms and parameterizations, the observed dispersion within an MSA ensemble expresses the difficulty of inferring a robust alignment. We refer to this notion as MSA difficulty. As downstream analyses heavily rely on the MSA, characterizing MSA difficulty for a given unaligned sequence set is critical. Initially, we show that measures of dispersion within diversified MSA ensembles can reliably predict MSA difficulty. We then assess the adequacy of these measures by computing the average reference-based distance between the MSAs in the MSA ensemble and its corresponding structural reference MSA and subsequently comparing this distance to the corresponding reference-free average distance over all MSA pairs in the ensemble. We find that Blackburne and Whelans dpos alignment metric is most appropriate as its reference-free [Formula] counterpart most accurately approximates the reference-based difficulty computed on BAliBASE reference data. We therefore use [Formula] to quantify MSA difficulty on a scale from 0 (easy) to 1 (difficult). Next, we introduce the AlDiScore open-source tool, which uses machine learning to directly and reliably predict reference-free difficulty scores from unaligned sequence sets to completely omit expensive MSA computations. The underlying regression model relies upon a large set of features, including sampling-based measures of transitive consistency. We trained our AlDiScore model on a diverse collection of empirical datasets from BAliBASE, TreeBASE, and published studies. Subsequently, we demonstrate that AlDiScore attains an R2 of 0.89 and of 0.84 on unseen AA and DNA sequence sets extracted from the PANDIT v17 database. Finally, we show that there is no correlation between MSA difficulty and the corresponding phylogenetic difficulty of the respective MSA.

11

Statistical inference of the Tree of Blobs of a phylogenetic network from quartet concordance factors

Rhodes, J. A.; Allman, E. S.; Ane, C.; Banos, H.

2026-05-31 evolutionary biology 10.64898/2026.05.28.728501 medRxiv

Top 0.1%

1.9%

Show abstract

A phylogenetic network represents evolutionary relationships involving hybridization, gene flow, or admixture. While the full network may not be identifiable from genomic data under common coalescent models, its tree of blobs, depicting only the tree-like portions of the network structure, is. We introduce ECToBlob (Edge Contraction for Tree of Blobs), a new statistically-consistent algorithm to estimate the tree of blobs from quartet concordance factors. Starting from a resolved tree, ECToBlob successively contracts edges which statistical tests indicate do not belong in the tree of blobs, due to reticulate or polytomous signal. We show that ASTRAL provides a valid starting tree under common assumptions, in that, asymptotically in the number of loci, trees optimizing ASTRALs criterion refine the tree of blobs. We describe several algorithm variants, differing in how evidence from multiple tests are combined to determine if the edge should be contracted, and provide software implementations. Relevance to Life SciencesHybridization, gene flow, or admixture are now recognized as important aspects of evolutionary history, but their genomic signal is confounded with that from a coalescent process, creating substantial challenges for inferring phylogenetic networks. The networks tree of blobs identifies areas where reticulation occurred, separated by tree-like branching. ECToBlob quickly estimates the tree of blobs using quartet concordance factors from gene trees, and provides a measure of statistical support for its result. Performance is illustrated through simulation and on empirical data, using an implementation in the R package MSCquartets. While the presence of a blob may be all that can be inferred in some cases, in others ECToBlob offers a robust and principled way to focus further analyses on more local reticulate structure. Mathematical ContentThis work makes contributions to mathematical phylogenetics in optimization, combinatorics, and statistics. We show that any tree maximizing quartet support (the criterion underlying ASTRAL) is a refinement of the networks tree of blobs under the coalescent model. Second, we give a concise proof that whether a network has a cut-edge corresponding to a given split is determined by information in certain subcollections of its 4-taxon subnetworks (quarnets). Finally, we propose valid statistical approaches for combining p-values across multiple quarnet hypothesis tests, proving that their use with specific decreasing test levels leads to statistically consistent inference as the number of loci grows. MSC codes05C90, 60J95, 62-04, 62F07, 92D15

12

Lossless compression of k-mer matrices enabling random row access

Regnier, A.; Lemane, T.; Bellenous, S.; Chikhi, R.; Peterlongo, P.

2026-07-08 bioinformatics 10.64898/2026.07.03.736306 medRxiv

Top 0.1%

1.9%

Show abstract

Genomic search engines such as Logan-Search index petabytes of sequencing data as large binary matrices, called k-mer matrices, where each row encodes the presence of a k-mer across thousands to millions of genomic samples. Logan-Search contains a petabyte of binary matrices, and storing them is expensive, yet compression must not prevent fast random access to any matrix row at query time. We present kmcomp, a lossless compression method for k-mer matrices that satisfies these competing requirements. Block compression partitions the matrix into fixed-size row blocks, each compressed independently; block start positions are stored in an Elias-Fano encoded array, enabling O(1) random access to any block. To improve compressibility without introducing additional decompression steps, we introduce the {pi}-compression: a column reordering that groups similar samples together by solving the Traveling Salesman Problem via a nearest-neighbor heuristic. We accelerate this heuristic with a novel variant of the vantage-point tree, the masked vp-tree, which dynamically prunes nearest-neighbor search space. On three (meta)genomic datasets, kmcomp achieves compression ratios of 1.3 to 5.4; {pi}-compression further improves these to 1.5 to 51.3. Applied to the Logan-Search petabyte-scale index, compression reduces storage by approximately half, and {pi}-compression adds a further 13% gain. Query overhead remains modest: queries of hundreds of nucleotides incur an absolute latency increase of {approx} 100 ms, and highly compressed indexes can match uncompressed query times thanks to reduced disk reads.

13

SwiftNJ: Fast Exact Neighbour Joining via Correctness-Gated Coding Agents

Christensen, J.

2026-05-29 bioinformatics 10.64898/2026.05.28.728410 medRxiv

Top 0.1%

1.7%

Show abstract

The capability profile of frontier coding agents in 2026 varies sharply across technical domains, motivating domain-specific empirical study of where, and under what oversight conditions, such systems can contribute to specialised technical work. This paper presents one such study in computational phylogenetics. Neighbour joining (NJ) is a widely used distance-based method for inferring evolutionary trees in microbial epidemiology, comparative genomics, and large-scale sequence clustering. Its constant-factor runtime is set by hand-tuned native implementations; RapidNJ is a widely-cited representative of that class and serves here as the comparison baseline. We ask whether a current-generation coding agent, operating under a correctness-gated optimisation harness with deterministic correctness gates calibrated against a QuickTree reference, can advance that constant factor on a fixed benchmark. The resulting implementation, SwiftNJ, achieves a geometric-mean runtime ratio of 0.565 against a locally-rebuilt RapidNJ-native binary across a 59-matrix corpus, sub-parity on 58 of 59 matrices. On 400 shuffled inputs drawn from 16 small matrices (n [≤] 2000), SwiftNJ matched the QuickTree reference at Robinson-Foulds distance zero. In this domain, a correctness-gated coding agent meaningfully improved on a strong native baseline, suggesting that harness-guided optimisation holds promise for performance-critical bioinformatics tools; further work is needed to establish how broadly the approach generalises.

14

Beyond infinite sites: Generalized ABBA-BABA statistic for deeper phylogenies

Zhang, C.; Nielsen, R.

2026-07-08 bioinformatics 10.64898/2026.07.06.736715 medRxiv

Top 0.1%

1.7%

Show abstract

The Patterson's D statistic detects gene flow from ABBA-BABA site patterns, but its biallelic site patterns fail under deeper divergences where multiple hits cause false positives. We propose two extensions, D+ and D*. Both incorporate multiallelic site patterns to reduce saturation bias under JC and F84 model. Simulations show that D+ and D* both remain correctly null under all conditions and detect gene flow effectively, with distinct advantages: D+ guarantees non-negativity of the denominator, while D* provides greater robustness when mutation rates vary across genomic regions. The source code and binary files are publicly available at https://github.com/chaoszhang/ASTER.

15

Binary search and and set operations on compacted k-mer lists

Dufresne, Y.; Andreace, F.

2026-07-03 bioinformatics 10.64898/2026.06.29.735436 medRxiv

Top 0.1%

1.3%

Show abstract

Sorted lists of elements are particularly good for computing set operations. A single scan of the two lists is sufficient to materialize or count the results of the union, intersection, difference, and xor operators. In bioinformatics, only a few tools are designed to perform these operations on k-mers. A fast tool like KMC allows set operations at the cost of storing individual k-mers. In this paper, we introduce a novel way to represent sorted k-mers as a collection of recomposed super-k-mer sorted lists. We introduce the concept of virtual super-k-mer and show how to construct, query and perform set operations on sorted lists of virtual super-k-mers. In the implementation sklib, we demonstrate high throughput of the data structure for construction and set operations, while remaining competitive in query capabilities, within a controlled memory footprint (2-5x decrease in bits/element compared to KMC).

16

Memory-safe high-performance sequence mapping with rammap

Wang, J. R.; Li, H.

2026-05-29 bioinformatics 10.64898/2026.05.26.726289 medRxiv

Top 0.2%

1.1%

Show abstract

We introduce a reimplementation of the widely used mapping tool minimap2 in Rust called rammap. We demonstrate perfect concordance with minimap2, enabling its backwards compatibility as a drop-in replacement for minimap2-based workflows. Additionally, rammap implements performance optimizations for modern architectures and applications, including AVX512 and WASM v128 SIMD support for dynamic programming alignment and SIMD-accelerated chaining. These achieve comparable or better performance than minimap2 across diverse mapping workloads while maintaining Rusts stronger memory safety constraints. The rammap API exposes both SIMD-accelerated sequence-sequence alignment modules and full mapping pipelines for use as an integrated library. Lastly, we describe the modular architecture and provide examples illustrating the extensibility of major mapping components, including seeding, chaining, and gap-filling/extension to support development of improved or domain-specific mapping components within a high-performance framework.

17

linearPOA: A parallel, memory-efficient framework for Partial Order Alignment with linear space complexity

Wei, Y.; Huang, Z.; Zhang, P.; Tian, Q.; Li, Y.; Zou, Q.; Yu, L.

2026-04-30 bioinformatics 10.64898/2026.04.27.720899 medRxiv

Top 0.2%

1.0%

Show abstract

Multiple sequence alignment (MSA) is a fundamental problem in computational bioinformatics, playing a critical role in genome biology, especially in long read sequencing and assembly. One solution for representing and solving MSA is Partial Order Alignment (POA), which employs Directed Acyclic Graphs (DAGs) to represent sequence relationships. However, when facing the ultra-long, error-prone reads (e.g., >100 kbps), existing POA algorithms with quadratic space complexity become impractical due to excessive memory consumption. This paper introduces the linearPOA, which based on divide-and-conquer strategy to solve the POA, aimed at saving memory compared to quadratic space complexity algorithms like SPOA, abPOA and TSTA. Particularly notable is its capability to save up to 102.74 times memory usage when aligning sequences with 100 kbp reads, compared to the abPOA method using non-heuristic methods. The algorithm was implemented within the linearPOA library, providing functionality for POA and foundational support for sequencing analysis, like error correction for reads. The linearPOA algorithm provides memory-efficient algorithms for long-read sequencing, especially in directly assembling long reads like 100 kbp reads. AvailabilityThe linearPOA library is freely available at https://github.com/malabz/linearPOA, and the data underlying this article are available in Zenodo, at https://doi.org/10.5281/zenodo.15637837. Supplementary informationSupplementary information are available at BioRxiv online.

18

Navigating the pangenome coordinate system with Shredtools

Shivakumar, V. S.; Langmead, B.

2026-07-08 bioinformatics 10.64898/2026.07.03.736354 medRxiv

Top 0.2%

1.0%

Show abstract

Existing notions of pangenome coordinates rely on hard-to-compute multiple sequence alignments. On the other hand, pangenome-wide exact unique matches (multi-MUMs) can be computed efficiently, and represent conserved stretches of columns in the underlying MSA. We introduce Shredtools, which uses multi-MUMs as pangenome waypoints and allows for sophisticated queries in pangenome coordinates. Its primary query is extract, which takes an interval of one sequence and extracts the smallest window containing it that is syntenic pangenome-wide. Shredtools' extract query can extract a gene region from 476 human genomes in half a second. Other queries help to refine these results, by finding local exact matches to improve the density of multi-MUM coverage ("enhance") and by selectively discarding sequences to improve the precision of the syntenic region ("zoom"). The Shredtools web interface (available at https://vikshiv.github.io/shredtools) allows for client-side handling of extract queries with index queries handled via simple and fast HTTP Range requests, simplifying usage and enabling pangenome-scale discoveries.

19

Efficient and Tidy Manipulation of Annotated Matrix Data with plyxp

Landis, J. T.; Love, M. I.

2026-05-11 bioinformatics 10.64898/2026.05.06.721669 medRxiv

Top 0.2%

0.9%

Show abstract

Manipulating high-dimensional omics data, such as bulk or single cell gene expression counts matrices, typically requires a bioinformatics analyst to learn domain-specific functions and syntax. These matrix-centric functions and syntax can be less intuitive than working with tidy data analytic principles, as exemplified by tools such as dplyr applied to tabular data. We propose an expressive grammar for manipulating annotated matrix data, with syntax to access, modify, and append matrix data and tabular row and column metadata, including row-wise or columnwise grouped operations. This grammar defines multiple contexts, and providing pronouns for specific recall and assignment within and across these contexts. The plyxp package is an implementation of this grammar for the R/Bioconductor ecosystem, with efficient abstractions for the SummarizedExperiment class. We demonstrate plyxps efficiency compared to alternative approaches on data manipulation tasks requiring computation across contexts.

20

DistPCA: Tera-Scale Genomic PCA via Out-of-Core Distributed Parallelism

Mermigkis, G.; Sofotasios, A.; Kontopoulou, E.-M.; Gallopoulos, E.; Hadjidoukas, P.

2026-05-19 bioinformatics 10.64898/2026.05.15.725487 medRxiv

Top 0.2%

0.9%

Show abstract

Principal Component Analysis (PCA) is a fundamental tool in human genetics, widely used to study population structure. However, the rapid growth of modern genomic datasets, which often exceed main memory capacity, renders traditional PCA methods infeasible, motivating out-of-core approaches. Prior work on out-of-core genomic PCA has focused primarily on optimizing the inherently compute-intensive numerical core, largely overlooking the stages of data I/O and preprocessing, which emerge as significant performance bottlenecks at tera-scale. Furthermore, existing approaches remain limited to shared-memory single-node architectures, lacking support for distributed multi-node environments. To address these limitations, we introduce DistPCA, the first distributed out-of-core framework for tera-scale genomic PCA, implemented as a C++ library and scalable across both single- and multi-node systems. Built on top of Message Passage Interface (MPI), the proposed framework employs multi-level data parallelism across the entire PCA pipeline, combining multiprocessing, multithreading, SIMD vectorization, and compute-transfer overlap, while remaining compatible with block-based methods that rely on associative operations. Extensive evaluation on real and synthetic datasets demonstrates near-linear scalability, achieving speedups of up to 58.2x and over 98% reduction in wall-clock time, while maintaining parallel efficiency above 82% and preserving accuracy in the recovered principal components.